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Abstract — We consider the tradeoff between the rate and the 
block-length for a fixed error probabihty when we use polar codes 
and the successive cancellation decoder. The "scaling" between 
these two parameters gives interesting engineering insights, and 
in particular tells us how fast one can approach capacity if our 
goal is to achieve a fixed block-error probability. 

Noticing the fact that for polar codes the exact scaling behavior 
is greatly dependent on the choice of the channel, our objective 
is to provide scaling laws that hold universally for all the BMS 
channels. Our approach is based on analyzing the dynamics of the 
un-polarized channels. More precisely, we provide bounds on (the 
exponent of) the number of sub-channels whose Bhattacharyya 
constant falls in a fixed interval [a,b]. Mathematically, this can 
be stated as bounding the sequence {ilogPr(^„ e [a,b])} , 
where Z„ is the Bhattacharyya process. We then use these bounds 
to derive trade-offs between the rate and the block-length. 

The main results of this paper can be summarized as follows. 
Let W be a BMS channel with capacity I{W). Consider the 
sum of Bhattacharyya parameters of sub-channels chosen (by 
the polar coding scheme) to transmit information. If we require 
this sum to be smaller than a given value Pc > 0, then the 
required block-length TV scales in terms of the rate R < I{W) 
^* ^ - (i(w)-R)t^ ' where a is a positive constant that depends 
on Pc and I{W). We show that fi = 3.55 is a valid choice, 
and we conjecture that indeed the value of /i can be improved 
to fi = 3.627, the parameter for the binary erasure channel. 
Also, we show that with the same requirement on the sum of 
Bhattacharyya parameters, the block-length scales in terms of 
the rate like A^ < ,,,„!! „,77 , where fl is a constant that depends 
on Pc and I{W), and /I = 7. 



I. Introduction 

Polar coding schemes [1] provably achieve the capacity 
of a wide array of channels including binary memoryless 
symmetric (BMS) channels. 

In coding, the three most important parameters are: rate (R), 
block-length {N), and block error probability (Pc)- Ideally, 
given a family of codes such as the family of polar codes, 
one would like to be able to describe the exact relationship 
between these three parameters. This however is a formidable 
task. Slightly easier is to fix one of the parameters and then 
to describe the relationship (scaling) of the remaining two. 

E.g., assume that we fix the rate and consider the relation- 
ship between the error probability and the block-length. This 
is the study of the classical error exponent. For instance, for 



random codes a closer look shows that P„ 



-NE{R,W)+o{N) 



where E{R, W) is the so-called random error exponent [2] of 
the channel W . For polar codes, Ankan and Telatar [3] showed 
that when W^ is a BMS channel, for any rate R < I{W) the 
block error probability is upper bounded by 2 for any 
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/3 < i and N large enough. This result was refined later in [4] 
to be dependent on R, i.e. for polar codes with the successive 
cancellation (SC) decoder 



Po 



_2t+xATQ'V7(%)+o(„) 



where' n = logTV and Q{t) = j^ e-^'/2^z/\/2^. 

Another option is to fix the error probability and to consider 
the relationship between the block-length and the rate. In other 
words, given a code and a desired (and fixed) error probability 
Po, what is the block-length N required, in terms of the rate 
R, so that the code has error probability less than Pc? This 
scaling is arguably more relevant (than the error exponent) 
from a practical point of view since we typically have a certain 
requirement on the error probability and then are interested in 
using the shortest code possible to transmit at a certain rate. 

As a benchmark, let us mention what is the shortest block- 
length that we can hope for. Some thought clarifies that 
the random variations of the channel itself require R < 
HW) - <S>{^) or equivalent^ N > Q{ ^j^w]_i^y )- Indeed, 
a sequence of works starting from [5], then [6], and finally [7] 
showed that the minimum possible block-length N required to 
achieve a rate R with a fixed error probability Po is roughly 
equal to 



N'. 



(1) 



(/(t¥)-P)2' 

where y is a characteristic of the channel referred to as 
channel dispersion. In other words, the best codes require a 
block-length of order ©( (jp^TRp-)- 

The main objective of this paper is to characterize similar 
type of relations for polar codes with the SC decoder. We argue 
in this paper that this problem is fundamentally related to the 
dynamics of channel polarization and specially the speed of 
which the polarization is taking place. To state things in a 
more convenient language, let us start with some preliminary 
definitions and settings regarding polarization and polar codes. 

A. Periminilaries 

Let W •■ X ^ y hs ?i BMS channel, with input alphabet 
A" = {0, 1}, output alphabet 3^, and the transition probabilities 
{W(y|x) •■ X e X ,y e y]. We consider the following three 
parameters for the channel W, 






W{y\l) + W{y\Q) 



W{y\l) 



Z{W) = Y. ^/W{y\Q)W{y\l), 



-i(ln. 



E{W) = Ww{y\l)e 

"^y^y 



In this paper all the logarithms are in base 2. 
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(2) 



(3) 



"^(y|i) |\ 
wTyToJ I -' . (4) 



The parameter H{W) is equal to the entropy of the output 
of W given its input when we assume uniform distribution 
on the inputs, i.e., H{W) = H{X\Y). Hence, we call the 
parameter H{W) the entropy of the channel W. Also note 
that the capacity of W, which we denote by I{W), is given 
by I{W) = 1 - H{W). The parameter Z{W) is called the 
Bhattacharyya parameter of W and E{W) is called the error 
probability of W. It can be shown that E{W) is equal to the 
error probability in estimating the channel input x on the basis 
of the channel output y via the maximum-likelihood decoding 
of VF(y|x) (with the further assumption that the input has 
uniform distribution). The following relations hold between 
these parameters (see for e.g., [1] and [13, Chapter 4]): 

0<2E{W)<H{W)<Z{W)<1, (5) 

H{W)<h2{E{W)), (6) 

Z{W) < ^1-{1-H{W))^, (7) 
where /i2(-) denotes the binary entropy function, i.e., 

h2{x) = -xlog2(a;) - (1 -x) log2(l -x). (8) 



B. Channel transform 

Let W denote the set of all the BMS channels and consider 
a transform W -^ (VK°,VK^) that maps W to W^ in the 
following manner. Having the channel W ■ {0, 1} -^ y, the 
channels W" : {0, 1} -> y"^ and W^ : {0, 1} -> {0, 1} x y^ are 
defined as 

W\yi,y2\xi)= Y. lw{yi\xi®X2)W{y2\x2) (9) 

W\yi,y2,xi\x2) = ^W{yi\xiex2)W{y2\x2), (10) 

A direct consequence of the chain rule of entropy yields 

H{W")+H{W^) 



W 



H{W). 
Regarding the other parameters, we have 

Z{W) < Z{W") < 1 - (1 - Z{W)f, 
Z{W^) = Z{Wf, 

and 

E{W^) = l-{l-E{W)f, 
E{Wf<E{W^)<E{W). 



(11) 

(12) 
(13) 

(14) 
(15) 



C. Channel Polarization 

Consider an infinite binary tree with the root node placed at 
the top. In this tree each vertex has 2 children and there are 2" 
vertices at level n. Assume that we label these vertices from 
left to right from to 2" - 1. Here, we intend to assign to each 
vertex of the tree a BMS channel. We do this by a recursive 
procedure. Assign to the root node the channel W itself. Now 
consider the channel splitting transform W -*• (W^°, W^^) and 
from left to right, assign W^ and W"^ to the children of the root 
node. In general, if Q is the channel that is assigned to vertex 
V, we assign Q" and Q^, from left to right respectively, to the 
children of the node v. In this way, we recursively assign a 
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Fig. 1. The infinite £-ary tree and the channels assigned to it for I, = 2. 



channel to all the vertices of the tree. Figure 1 shows the first 
2 levels of the binary tree. Assuming N = 2", we let VF^ 
denote the channel that is assigned to a vertex with label i 
at level n of the tree, 0<i<N-l. Asa result, one can 

(i) 

equivalently relate the channel M^^ to W via the following 
procedure: let the binary representation of i be bib2---bn, where 
6i is the most significant digit. Then we have 



w^^^ = iiiw^'f'y)^-. 



(7) 



As an example, assuming z = 6, n = 3 we have W^ 
((T4^^)^)°. We now proceed with defining a stochastic process 
called the polarization process. This process can be considered 
as a stochastic representation of the channels associated to 
different levels of the infinite binary tree. 



D. Polarization Process 

Let {Bn,n > 1} be a sequence of iid Bemoulli(i) random 
variables. Denote by (JF, i7,Pr) the probability space gener- 
ated by this sequence and let (JF„, ri„, Pr„) be the probability 
space generated by {Bi,---,Bn)- For a BMS channel W, de- 
fine a random sequence of channels W„, n6N= {0,1,2,---}, 
as Wo = W and 

n- 
rl 



Wn 



W^,_, ifBn = 0, 



W^_, ifB„ = l, 



(16) 



where the channels on the right side are given by the 
transform W„_i ->• iW^_i,W^_i). Let us also define the 
random processes {iJ„}„(:N, {^n}nEN, {Zn}nm and {-E„}„eN 

as iJ„ = H{Wn), In = I{Wn) = 1 " H{Wn), Z„ = Z(W„) 

iindE„ = E{Wn). 

Example 1: By a straightforward calculation one can show 
that for W = BEC(z) we have 



W° = BEC(l-(l-z)2) 



(17) 
(18) 



Hence, when W = BEC(z), the channel W„ is always a BEC. 
Furthermore, the processes _ff„,/„,Z„ and £'„ admit simple 
closed form recursions as follows. We have Ho = z and for 



Hn 



l-(l-if„-i)^ w.p. 



Hl-i, 



w.p 



? 



(19) 



■F- 2- 



Also, we have^ 2ii'„ = iJ„ = 1 - /„ = Z„. 



^For the channel W = BEC(2), it is easy to show that 2E{W) = H{W) = 
Z{W) = z. 



For channels other than the BEC, the channel W„ gets quite 
complicated in the sense that the cardinality of the output 
alphabet of the channel Wn is doubly exponential in n (or 
exponential in A^). Thus, tracking the exact outcome of Wn 
seems to be a difficult task (for more detail see [15], [16]). 
Instead, as we will see in the sequel, one can prove many 
interesting properties regarding the processes i/„, Z„ and En- 
Let us quickly review the limiting properties of the above 
mentioned processes [1], [3]. From (11) and (16), one can 
write for n > 1 

E[HiWn)\Wn-,] <'=^^ g«-l)^g(^n-l) CJ) HiWn-,). 

(20) 
Hence, the process i?„ is a martingale. Furthermore, since 
Hn is also bounded (5), by Doob's martingale convergence 
theorems, the process iJ„ converges in C^ (and almost surely) 
to a limit random variable Hoo- As the convergence is in C^, 
as n ^ oo we have 

E[\Hn-Hn-l\]=E[\H{W^)-H{Wn)\]^0. 

As a result, we must have that H{Wn) - H{Wn) converges 
to almost surely (a.s.). We now claim that for a channel P, 
in order to have H{P") = H{P) we must have H{P) = 
(i.e., P is the noiseless channel) or H{P) = 1 (i.e., P is the 
completely noisy channel). By this claim and the fact that _ff„ 
converges a.s. to Hoo, we conclude that Hoo take its values in 
the set {0, 1}. Also, as E[i7„] = E.[Hoo] = H{W), we obtain 



H^ 



w.p.l -H{W), 

1 w.p. H{W). 



(21) 



It remains to prove the claim mentioned above. We use the 
so called extremes of information combining inequalities [13]. 
Let P be an arbitrary BMS channel. To simplify notation, let 
h = H{P) and also let e e [0, i] be such that /12(e) = H{P). 
We have 



?l2(2£(l-£)) 



l-(l-h)^ 



h < H{BSC{ef) < H{P°) < H{BEC{hf), 
H{BEC{hy) < H{P^) < iJ(BSC(e)^) < h. 



(22) 
(23) 



h^ 



2/i-h2(2e(l-e)) 



Now, to prove the claim, assume that P is such that H{P'^) = 
H{P). Using (22) we obtain H(BSC{hf) = H{P) or 
equivalently /i2(2e(l - e)) = /12(e). As a result, e must be 
a solution of the equation e= 2e(l-e) which yields e = 0, |. 
Also, as H{P) = /12(e), then H{P) can either be or 1 and 
hence the claim is justified. Using the bounds (5)-(7) it is clear 
that the processes Z„ and En converge a.s. to Hoa and ^Hoa, 
respectively. 

E. Polar Codes 

Given the rate R < I{W), polar coding is based on choosing 
a set of 2"i? rows of the matrix G„ = [11] to form a 
2"i? X 2" matrix which is used as the generator matrix in the 
encoding procedure. The way this set is chosen is dependent 
on the channel W and is briefly explained as follows: Choose 
a subset of size NR from the set of channels {Wp^ }a<i<N-i 



that have the least possible error probability (given in (4)) and 
choose the rows G„ with the same indices as these channels. 
E.g., if the channel VK^ is chosen, then the i-th row of G„ is 
selected. In the following, given N, we call the set of indices 
of NR channels with the least error probability, the set of 
good indices and denote it by In.r- In the sequel, we will 
frequently use the term "the set of good indices" and In.r 
interchangeably. 

It is proved in [1] that the block error probability of such 
polar coding scheme under SC decoding, denoted by P^, is 
bounded from both sides by^ 



ms.-K E{W^^^)<Pe< Y ^(W^AT ) 



(0^ 



i^Tx^ 



(24) 



iEl« 



We now briefly explain why such a code construction is 
reliable for any rate R < I{W), provided that the block- 
length is large enough. Recall from Section I-D that the 
process _E„ = E{Wn) converges a.s. to a rv. Eos such that 
Vi{Eoo = 0) = 1 - H{W) = I{W). Hence, it is clear from the 
definition of the set good indices , In.b., that the left side of 
(24) decays to as n grows large. However, the story is not 
over yet as this is only a lower bound on P^. Nonetheless, one 
can also show that the right side of (24) decays to 0. This was 
initially shown in [1] and later in [3] the authors showed that 
all of the three terms in (24) behave like 2 

II. Problem Formulation 

As we have seen in the previous section, the processes Hn 
and Zn polarize in the sense that they converges a.s. to a {0, 1} 
valued r.v. Hoo and Zoo ■ Here, we investigate the dynamics of 
polarization. We start by noting that at each time n there still 
exists a (small and in n vanishing) probability that the process 
Zn (or Hn) takes a value far away from the endpoints of the 
unit interval (i.e., and 1). Our primary objective is to study 
these small probabilities. More concretely, let < a < 6 < 1 
be constants and consider the quantity Pr(Z„ e [a, 6]). This 
quantity represents the fraction of sub-channels that are still 
un-polarized at time n. An important question is how fast 
the quantity Pr(Z„ e [a,fe]) decays to zero. This question is 
intimately related to measuring the limiting properties of the 
sequence {ilogPr(Z„ e [a,6])}„eN- 

Example 2: Assume W = BEC(z). In this case the process 
Zn has a simple closed form recursion as Zq = z and 

f Z^ wp - 

^-^' = \l-{l-Zn)\ W.p. |! ^2^^ 

Hence, it is straightforward to compute the value Pr(Z„ e 
[a, 6]) numerically. Let a = 1 - 6 = 0.1. Figure 2 shows the 
value ilog(Pr(Z„ e [a, 6])) in terms of n for z = 0.5,0.6,0.7. 
This figure suggests that the sequence {^ logPr(Z„ e [a, 6])} 
converges to a limiting value that is somewhere between -0.27 
and -0.28. Note that for different values of z, the limiting 
values are very close to each other. O 

For other BMS channels, the process Z„ does not have a 
simple closed form recursion as for the BEC, and hence we 

^Note here that by (4) the error probabihty of a BMS channel is less than 
its Bhattacharyya value. Hence, the right side of (24) is a better upper bound 
for the block error probability than the sum of Bhattacharyya values. 



-0.25 




Fig. 2. The value of — log(Pr(Z„ e [a, b]) versus n for a = l-b = 0.1 
when VF is a BEC with erasure probability z = 0.5 (top curve), z = 0.6 
(middle curve) and z = 0.7 (bottom curve). 



need to use approximation methods (for more details see [15], 
[16]). Using such methods, we have plotted in Figure 3 the 
value of VT:{Zn e [a, 6]) (a = 1 - & = 0.1) for the channel 
families BSC(e), and BAWGNC(o-) with different parameter 
values. 
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Fig. 3. Left figure: The value of — log(Pr(Z„ e [ii^]) versus n 
for a = 1-6 = 0.1 and W being a BSC with cross-over probability 
e = 0.11,0.146,0.189. These BSC channels have capacity 0.5, 0.4 and 0.3, 
respectively. Right figure: the value of — log(Pr(Zn e [0,6]) versus n for 
a = l-6 = 0.1 and VF is a BAWGN with noise variance a = 0.978 (top 
curve), a = 1.149 (middle curve), and a = 1.386 (bottom curve). These 
BAWGN channels have capacities 0.5, 0.4 and 0.3, respectively. 



Pr(Z„ 6 [a, 6]) decay the fastest or the slowest? 

Let us now be more ambitious and aim for our ultimate goal. 

Question 3: Can we characterize the exact behavior of 
Pr(Z„ 6 [a, 6]) as a function of n, a,b and 14^? 
Finally, we ask how the answers to the above questions 
will guide us through the understanding of the finite-length 
scaling behavior of polar codes. An immediate relation stems 
from the fact that the quantity Pr(Z„ e [a, 6]) indicates the 
portion of the sub-channels that have not polarized at time 
n. In particular, all the channels in this set have a large 
Bhattacharyya value and hence cannot be included in the set 
of good indices. Therefore, the maximum reliable rate that we 
can achieve is restricted by the portion of this yet un-polarized 
channels. Consequently, the answers to the above questions 
will be crucial in finding answers to the following question. 

Question 4: Fix the channel W and a target block error 
probability Pc- To have a polar code with error probability less 
than Pe, how does the required block-length N scale with the 
rate R7 

Finding a suitable answer to the above questions is an 
easier task when the channel M^ is a BEC. This is due to 
the simple closed form expression of the process Z„ given 
in (25). In the next section (Section III), we provide heuristic 
methods that lead to suitable numerical answers to Questions 1 
and 3 for the BEC. As we will see in the next section, 
such heuristic derivations are in excellent compliance with 
numerical experiments. Using such derivations, we also give 
an answer to Question 4 for the BEC. The heuristic results of 
Section III provide us then with a concrete path to analytically 
tackle the above questions. In Section IV we provide analytical 
answers to Questions 1-4 for the BEC as well as other BMS 
channels. Proving the full picture of Section III is beyond what 
we achieve in Section IV, nevertheless, we provide close and 
useful bounds. 



The above numerical evidence suggests that the quantity 
Pi{Z„ 6 [a, 6]) decays to zero exponentially fast in n. 
Further, we observe that the limiting value of this sequence is 
dependent on the starting channel W (e.g., from the figures 
it is clear that the channels BEC, BSC and BAWGN have 
different limiting values). Let us now be concrete and rephrase 
the above speculations as follows. 

Question 1: Does the quantity Pr(Z„ e [a, 6]) decay 
exponentially in n? If yes, what is the limiting value of 
ilogPr(Z„ 6 [a, 6]) and how is this limit related to the 
starting channel W and the choice of a and 6? 
From Figures 2 and 3, we observe that the value of 
ilogPr(Z„ 6 [a,b]) is the least when M^ is a BEC and this 
suggests that the channel BEC polarizes faster than the other 
BMS channels. This is intuitively justified as follows: Fix a 
value z 6 (0, 1) and assume that W is a BMS channel with 
Bhattacharyya parameter Z{W) = z. Now, consider the values 
Z{W°) and Z{W^). Using relations (12) and (13), it is clear 
that the values Z{W^) and Z{W^) are closest to the end 
points of the unit interval if M^ is a BEC. In other words, at 
the channel splitting transform, the channel BEC(2;) polarizes 
faster than the other BMS channels. 

Question 2: For which set of channels does the quantity 



III. Heuristic Derivation for the BEC 

A. Scaling Law Assumption 

Throughout this section we assume that the channel W is 
the BEC(z) where z e [0, 1]. To avoid cumbersome notation, 
let us define 



Pn{z,a,b) = Pr(Z„ e [a, 6]), 



(26) 



where Z„ is the Bhattacharyya process of the BEC(2;). We 
start by noticing that by (25) the function pn{z,a,b) satisfies 
the following recursion 

/ ,x Pn{z'^,a,b)+Pn{l-{l-z)'^,a,b) 
Pn+i{z,a,b) = , (27) 



with 



Po{z,a,b) = l{«[a,fc]}- 



(28) 



More generally, one can easily observe the following. Let 
g ■ [0,1] -^ M. be an arbitrary bounded function. Define the 
functions {gnjneN as 



g„(z) = E[.g(Z„)]. 



(29) 



Note here that in (29) the parameter z is the starting point of 
the process Z„, i.e., Zq = z. The functions {.g„}„eN satisfy the 
following recursion for n € N 



5n+l(^) 



g„(2')+.9„(l- (1-2)2) 



(30) 



This observation motivates us to define the polar operator, call 
it T, as follows. Let B be the space of bounded measurable 
functions over [0, 1]. The polar operator T : B ^ B maps a 
function g e B to another function in B in the following way 



Tig) 

It is now clear that 



Kz2)+.9(l-(l-z)2) 



E[g{Zn)] = ToTo-oT{g) = r\g). 



(31) 



(32) 



In this new setting, our objective is to study the limiting 
behavior of the functions T"{g) when g is a simple function 
as in (28). This task is intimately related to studying the largest 
eigenvalues of the polar operator T and their corresponding 
eigenfunctions. In this regard, to keep things in a simple 
and manageable setting, we first consider finite-dimensional 
approximations of T. This is done by discretizing the unit 
interval into very small sub-intervals with the same length and 
by assuming that T operates on all the points of these sub- 
intervals in the same way. More concretely, consider a (large) 
number L e N and let the numbers x^, i e {0, 1, ••-,£- 1} be 
defined as Xi = j^. Hence, the unit interval [0,1] can be 
thought of as the union of the small sub-intervals [xi,Xi+i]. 
Now, for simplicity assume that g is a (piece-wise) continuous 
function on [0, 1]. Intuitively, by assuming L to be large, we 
expect that the value of g is the same throughout each of the 
intervals [xi,Xi+i]. Such an assumption seems also reasonable 
for the function T{g) given in (31). We can approximate the 
function g as an L dimensional vector 



9l ~ [g{xo),g{xi),---,g{xL-i)]- 



(33) 



In this way, we expect that the function T{g) can be well 
approximated by a matrix multiplication 



T^ gLTL, 



(34) 



where the Lx L matrix Tj, is defined as follows. Let TL{i,j) 
be an element of T^ in the i-th row and the j-th column. 
Define Tl{1, 1) = Tl{L, L) = 1 and for the other elements of 
Tl we let 



niij) 



ifj = [i(l-(l-f)2)l, 



(35) 
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1000 
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4000 


8000 


A2(L) 


0.8227 
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>^s(L) 


0.6878 


0.6958 


0.7012 


0.7046 



TABLE I 

Values of X2{L) and A3(L), which correspond to the 

SECOND AND THIRD LARGEST EIGENVALUES OF Tl (IN 

ABSOLUTE VALUE), ARE COMPUTED NUMERICALLY FOR 

DIFFERENT VALUES OF L. 



As an example, the matrix Tj, for L 
form 
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All the rows of Tl sum up to 1. Hence, an application of 
the Perron-Frobenius theorem [14] shows that the eigenvalues 
of Tl are all inside the interval [-1,+1]. Also, it is easy to 
see that Tl has a trivial eigenvalue equal to Ao = 1 with two 
corresponding eigenvectors 



vo = (l,0, 
vi = (0,0, 



,0), 
,1)- 



A little thought shows that the vq and vi correspond to the 
two extremal states of the polarization (i.e., the perfect channel 
and the useless channel). This can be justified by the fact that 
if we start from any initial vector Cp that has value one at 
position p and value zero elsewhere, then 



dp-l^L 



CQVf) +CiVi, 



where co and ci are positive constants. This is just a rough 
observation of the polarization phenomenon. In fact, by polar- 
ization one can easily guess the following. Assuming p = zL, 
we have 



Co 



Cl 



1-z, 

z. 



However, we are interested in finding out how fast such a 
convergence is taking place. For this purpose, we look at the 
second and third largest eigenvalues (in absolute value) of Tl 
as L grows large. We denote the second largest eigenvalue of 
Tl by X2{L) and the third largest is denoted by \3{L). Table I 
contains the value of these eigenvalues computed numerically 
for several (large) values of L. It can thus be conjectured that 



lim A2(i)~ 0.826, 
lim A3(i)« 0.705. 



(36) 
(37) 



This belief guides us to conclude that for L growing large, 
if we start from any vector g which is not a multiple of the 
eigenvectors of Tl, then 



gT2 « cqVo + civi + C2A2 W2 + 0(nA3 ). 



(38) 



The above approximate relation indicates that for large L, the 
distance of gT'^ from the limiting value is roughly equal to 

Now, let us go back the original polar operator T defined 
in (31). As we argued above, the operators T^, for L large, 
are good finite-dimensional approximations of T. The (exper- 
imental) relation (38) brings us to the following assumption 
about T. 

Assumption 1 (Scaling Assumption): There exists /i 6 
(0,00) such that, for any z,a,b e (0,1) such that a < b, the 
limit \iinn^oo2~Pn{z,a,b) exists in (0, 00). We denote this 
limit by p{z, a, b). In other words. 



lim 2MPr(Z„ € [a,b]) =p{z,a,b). 



(39) 



We call the value /i the scaling exponent of polar codes for 

the BEC. 

Note here that by (36) we expect that 

1 



lim X2{L) « 0.826^ 



0.275. 



M 



(40) 



Let us now describe a numerical method for computing /i 
and p(a,6, z). In this regard, we follow the approach of [11]. 
First we note that by (27) and the scaling law assumption we 
conclude that 

'2,a,fe)+p(l-(l-z)2,a,6) 



2 i^p{z,a,b) 



p{2 



(41) 



Equation (41) can be solved numerically by recursion. First 
of all, note that the equation is invariant under multiplicative 
scaling of p. Also, from the equation one can naturally guess 
that p{z, a, b) can be factorized into 



p{z, a, b) = c{a, b)p{z), 



(42) 
1. We iteratively 



where p{z) is a solution of (41) with p{^) 
compute fj, and p{z). 

Initialize po{z) -say- with P(){z) =4z(l-z) and compute 
recursively new estimates of pn+i{z) by first computing 

Pn+l{z) =Pn{z^) +P„(l-(l-2;)^), 

and then by normalizing Pn+i{z) = Pn+i{z) I Pn+i{^), so that 
Pn+i{\) = 1- We have implemented the above functional 
recursion numerically by discretizing the z axis. Figure 4 
shows the resulting numerical approximation of poo{z) as 
obtained by iterating the above procedure until ||p„+i(z) - 
Pn{z)\oa < 10"^" (Vz 6 [0,1]) and by using a discretization 
with 10^ equi-spaced values of z. From this recursion we 
also get a numerical estimate of the scaling exponent ^i. In 
particular we expect pn{l/2) -^ 2~ as n -^ 00. Using this 
method, we obtain the estimate l//i « 0.2757. 

As mentioned above, the function p{a,b,z) differs from 
p{z) by a multiplicative constant c{a, b) that is to be found by 
other means. In Figure 5 we plot the functions 2~pn{z,a,b) 
for a = 1 - b = yq and different values of n. We observe 



1 

Fig. 4. The function p{z) for z e [0, 1]. 
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Fig. 5. The functions 2 f" pn{a, b, z) for various values of n. Here we have 
fixed a = 1 - 6 = 0.9 and i- = 0.2757. In all of the four plots the dashed 
curve con'esponds to c{a,b)p{z) with c{a,b) = 0.92. Here, the function 
p{z) corresponds to the numerical solution of (41). 



that, as n increases these plots and the curve c{a, b)p{z) with 
c{a,b) = 0.92 match very well. Even for moderate values of 
n (such as n = 10) we observe that the curves have a fairly 
good agreement. 

Let us now see what the scaling law assumption implies 
about the finite-length behavior of polar codes. For simplicity, 
we assume that communication takes place on the BEC(-i). 
We are given a target error probability Pe and want to achieve 
a rate at least R. What block-length N should we choose? 

Consider the process Z„ with z = i. It is easy to see that 
the set of possible values that Z„ takes in [0,1] is symmetric 
around z = |. Now, according to the scaling law for x e [0, ^], 
there is a constant p{^,x,^) = c{x) such that 



Pr(Z„6[x,-])«c(a;)2--, 



(43) 



As as result, noticing the fact that Z„ is symmetric around the 
point z = I we get 



Pr(Z„€[0,a:])<i-c(a:)2- 



(44) 



From the construction procedure of polar codes (and specially 
relation (24)), we know the following. Let z(l) < z{2)--- < 
z{N) be a re-ordering of the N possible outputs of Z„ in an 



ascending order. Then, the error probabiHty of a polar code 
with rate R is bounded from below by 

z{N.Rf 



Pe > 1 - ^l-z{N.Ry > 



(45) 



So in order to achieve error probability P^, we should certainly 



z(N.Ry 



2 <PcOrz{N.R)< \/2K. Hence, by using (44) 
we deduce that 



have 



and finally, 



i?<Pr(Z„6[0,V2^]) 
1 



<--c(V2Pe)2-^ 



h-R 



Now, from the above calculations we know that - 
as a result for the channel W = BEC(^) we have 

1 



N>e 



i 



(/(VK) - i?)3-627 



) 



(46) 
0.2757 

(47) 



In the next section, we provide methods that analytically 
validate the above observations. We also extend some of these 
observations to other BMS channels. 

IV. Analytical Approach: from Bounds for the 
BEC TO Universal Bounds for BMS Channels 

In this section we provide a rigorous basis for the obser- 
vations that were derived in the previous section. Proving the 
full picture of Section III is beyond what we achieve here, but, 
we come up with close and useful bounds. 

A. Characterization of fi for the BEC 

We provide two approaches, that exploit different tech- 
niques, to compute the scaling exponent ji for the BEC. The 
first approach is based on a more careful look at equation 
(31). We observe that simple bounds can be derived on 
the largest nontrivial eigenvalue of the polar operator T by 
carefully analyzing the effect of T on some suitably chosen 
test functions. This approach provides us with a sequence 
of bounds on ji. We conjecture (and observe empirically) 
that these bounds indeed converge to the value of /i that 
is computed in Section III. The second approach considers 
different compositions of the two operations z^ and 2z - z^ 
and analyzes the asymptotic behavior of these compositions. 
This approach provides us with a close lower bound on /i. 

1) First Approach: Consider the polar operator defined in 
(31). The objective here is to compute the largest eigenvalues 
of T. Specifically, we want to find the largest solutions of 



T{.f) = A/. 



(48) 



A check shows that both f{z) = z and f{z) = 1 are 
eigenfunctions associated to the eigenvalue A = 1. Perhaps 
more interestingly, let us look at the eigenvalues of T inside 
the interval (0, 1). Intuitively, equation (41), together with the 



scaling law, can be reformulated as follows. The operator T 
has an eigenvalue A = 2"~ and a corresponding eigenfunction 
p{z) such that if we take any step function f{z) = Iszeia.b]}^ 
then 

A-"T"(/)'^c(a,6)p(z). (49) 

In fact, if the scaling law is true, then we naturally expect 
that (49) holds for a much larger class of functions rather 
than the class of step functions. Heuristic arguments of the 
previous section also suggest that (49) holds for all (piece- 
wise) continuos functions f{z) with /(O) = /(I) = 0. 

Motivated by this picture, one approach to find bounds on 
the eigenvalue consists of the following two steps: (1) choose 
a suitable "test function" f{z) for which we can provide good 
bounds on the behavior of T"{f) and (2) turn these bounds 
into bounds on the corresponding eigenvalue (or /i). With this 
in mind, for a generic test function f{z) : [0, 1] -^ [0, 1], let us 
define the sequence of functions {fn{z)}rwifi as /„ : [0,1] -^ 
[0, 1] and for z e [0, 1], 



/„(z)=E[/(Z„)]=T"(/). 



(50) 



Here, note that for z e [0,1] the value of fn{z) is a 
deterministic value that is dependent on the process Z„ with 
the starting value Zq = z. Let us now recall once more the 
recursive relation of the functions /„: 



fo{z) = f{z), 



(51) 



fniz) 



fn-liz^)+fn-lil-il-zy) 



In order to find lower and upper bounds on the speed of decay 
of the sequence /„, we define sequences of numbers {am}mefi 
and {bmjmeN as 



inf 



fm+i{z) 

fm+l{z) 

sup 



(52) 
(53) 



Z6[0,l] fm(z) 

Lemma 3: Fix ?ti e N. For all n > to and z e [0, 1], we have 

{am)"-"'fmiz) < fniz) < (6™)"-'"/™(z) • (54) 

Furthermore, the sequence Um is an increasing sequence and 
the sequence bm is a decreasing sequence. 

Proof: Here, we only prove the left-hand side of (54) and 
note that the right-hand side follows similarly. The proof goes 
by induction on n - to,. For n - m = the result is trivial. 
Assume that the relation (54) holds for a n- m = k, i.e., for 
z 6 [0, 1] we have 

{arn)''frn{z)< frn+k{z). (55) 

We show that (54) is then true for fc + 1 and z e [0, 1]. We 
have 



/, 



m+k+l 



(z) 



(a) frn+kiz"^) + fm+k{l " (1 " zf) 



(b) (amrfmiz^) + {amrfmil - (1 - z)^) 



(am) frn+l{z) 



m 





2 


4 


6 


10 


dm 


0.75 


0.7897 


0.8074 


0.8190 


0.8239 



loga„ 



-0.4150 -0.3406 -0.3086 -0.2880 -0.2794 



TABLE II 

The values of am corresponding to the test function 

/O = z{l~ z) ARE NUMERICALLY COMPUTED FOR SEVERAL CHOICES OF 

m. 




0.8312 



0.8294 



0.8279 



6 
0.8268 



0.8264 



logfe™ -0.2663 -0.2699 -0.2725 -0.2744 -0.2751 

TABLE III 2 

The values of 6m corresponding to /o = (^(l - 2;)) 3 are 

NUMERICALLY COMPUTED FOR SEVERAL CHOICES OF m. 



_/ s.fc /m+l(^) r ( \ 

Jrn{h) 

> (a„0'=[ inf il^^lUiz) 

~ \^7n } JjnyZ}- 

Here, (a) follows from (51) and (b) follows from the left- 
side inequality in (55), and hence the lemma is proved via 
induction. ■ 

Let US now begin searching for suitable test functions, i.e., 
candidates for f{z) that provide us with good lower and upper 
bounds am and bm- We expect that having a polynomial test 
function might be slightly preferable. This is due to the fact 
that if / is a polynomial, then T"{f) is also a polynomial 
and computing «,„ and b„i is equivalent to finding roots 
of polynomials which is a manageable task. Of course the 
simplest polynomial that takes the value on z = 0, 1 is 
fo{z) = z{\ - z). Hence, let us take our test function as 
/(z) = fo{z) = z{l - z) and consider the corresponding 
sequence of functions {/n(2;)}„eN, 



/„(z) = E[Z„(l-Z„)] = T"(/o). 



(56) 



A moment of thought shows that with /o = z{l - z) the 
function 2"/„ is a polynomial of degree 2"^^ with integer 
coefficients. Let us first focus on computing the value of Um 
for m € N. If the relation (49) holds true, then we expect that 
the value of «,„ converges to A = 2~~ as to grows large. 

Remark 4: One can compute the value of «„ by finding 
the extreme points of the function ^^^ (i-^-, finding the 
roots of the polynomial gm = f'm+ifm - fm+if m) and 
checking which one gives the global minimum. Assuming 
/o = z(l - z), for small values e.g., tti = 0, 1, pen and paper 
suffice. For higher values of m, we can automatize the process: 
all these polynomials have rational coefficients and therefore 
it is possible to determine the number of real roots exactly 
and to determine their value to any desired precision. This 
task can be accomplished precisely by computing so-called 
Sturm chains (see Sturm's Theorem [17]). Computing Sturm 
chains is equivalent to running Euclid's algorithm starting with 
the second and third derivative of the original polynomial. 
Hence, we can find the value of «,„ analytically to any desired 
precision. Table II contains the numerical value of a„i up to 
precision 10"^ for m < 10. As the table shows, the values a^ 
are increasing (see Lemma 3), and we conjecture that they 
converge to 2""-^^'''^ = 0.8260, the corresponding value for the 
channel BEC. 

Let us now focus on computing the value of bm- On the 
negative side, for the specific test function f{z) = z{l - z) 
we obtain bm = 1 for to, e N and therefore the upper bounds 
of (53) are of trivial use. In fact, it is not hard to show that 



if we plug in any polynomial as the test function then we 
get fom = 1 for any m. On the positive side, we can consider 
other test functions that result in non-trivial values for bm- 
The problem with non-polynomial functions is that methods 
such as the Sturm-chain method no longer apply here. Hence, 
finding the precise value of bm up to a desired precision can 
be a difficult task and we lose the analytical tractability of bm- 
As an example, choose 



/o(z) = z"(l-z) 



/3 



(57) 



for some choice of a,/3 e (0, 1). Then, from (53) we have 



&o = sup 



fiiz) 



sup 



z^'il + zf + (2-z)"{l-zy 



[0,1] fo{z) ze[0.1] 2 

(58) 
By letting a = /3 = |, we numerically get 60 = 0.8312 which is 
already a close bound for A. This suggests that the test function 
fo{z) = f{z) = (z(l-z))3 is suitable candidate for obtaining 
good upper bounds bm- For this specific test function, the value 
of bm for various values of m has been numerically computed 
in Table III. As we observe from Table III, even for moderate 
values of to the (numerical) bound bm is very close to the true 
"value" of A. 

Finally, let us relate the bounds «,„ and bm to bounds on 
the functions p„(a,6,z). We have 

Lemma 5: Consider the test function f{z) = z(l - z) and 
the corresponding sequence of function /„ defined in (51). 
Let a,b e (0, 1) be such that ^/a < 1 - \/l - b- Then, there are 
constants ci,C2 > such that for any z e (0, 1) 

-\ogUz) - ^i^^^ < -logPr(Z„ 6 [a, 6]) 
n n n 

<-log/„(z) + ^. (59) 
n n 

Also, for the test function f{z) = (z(l - z))^ and the 
corresponding sequence /„, defined in (51), we have for 
a, 6 6 (0,1) 



-logPr(Z„ 6 [a, 6]) < ilog/„(z) + ^, 
n n n 



(60) 



where C3 is a positive constant. 
We can now easily conclude that 

Corollary 6: Fix m e N- For a,b e [0, 1] such that ^/a < 
1 - \/l - b and n <m we have 

loga„, + 0(^^) < - logPr(Z„ e [a, 6]) < log6„ + O(-), 
n n ^^^^ 

where am is defined in (52) with the test function f{z) = 
z{l - z) (see Table II), and bm is defined in (108) with the 
test function f{z) = (z(l - z))^ (see Table III). 



Remark 7: We expect that the result of of Lemma 5 holds 
for any choice of a and b such that a <b. That is, the condition 
\/a < 1 - \/l - b is not a serious condition and is just given to 
ease out the proof. 

2) Second Approach:: Throughout this section we will 
prove the following theorem. 

Theorem 8: We have 

1 



1 r^ 

liminf - logl / PriZn e [a, b])dz} > -^ 1 « -0.2787. 

n^°o n Jo 2 In 2 

(62) 

Let us now explain, at the intuitive level, the main consequence 

of Theorem 8. By using the scaling law assumption, and 

specifically (38) and (39), we have that /q Pr(Z„ e [a, b])dz f» 

/q 2^~p{z,a,b)dz + o{2^'i^). This relation together with (62) 

results that /i > j^ - 1 « -0.2787. For the sake of briefness, 

we do not address here further (analytic) conclusions of 

Theorem 8 and we refer the reader to [12]. 

To proceed with the proof, let us recall from Section 1-D 

the definition of Z„ (for the BEC) in terms of the sequence 

{Bn}neTi- We Start hy Zq = z and 



Z„ 



2Zn-l 



y1 



;if S„ 
;if5„ 



1, 
0. 



Hence, by considering the two maps io,ii ■ [0,1] 
defined as 

to{z) = 2z - z ,ti{z) = z , 



(63) 

[0,1] 
(64) 



the value of Z„ is obtained by applying Ib^ on the value of 
Z„-i, i.e., 

Zn = tBAZn-l)- (65) 

The same rule applies for obtaining the value of Zn-i form 
Zn-2 and so on. Thinking this through recursively, the value of 
Zn is obtained from the starting point of the process, Zq = z, 
via the following (random) maps. 

Definition 9: For each n e N and a realization (61, •••, fe„) = 
uJn 6 f^n define the map (f)^^ by 



4„ °4„_i °---tbi- 



(66) 



Also, let $„ be the set of all such n-step maps. 
As a result, an equivalent description of the process Z„ is as 
follows. At time n the value of Z„ is obtained by picking 
uniformly at random one of the functions in (1)^:^ e $„ and 
assigning the value (/)(^^(z) to Z„. Consequently we have. 



1 



Pr(Z„6[a,6])= ^ -1 



{0^„(2)6[q.6]}- 



(67) 



0iJ„e*7l 



Using (67), it is apparent that in order to analyze the 
behavior of the quantity ilogPr(Z„ e [a,&]) as n grows 
large, it is necessary to characterize the asymptotic behavior 
of the random maps (p^j^- Continuing the theme of Defi- 
nition 9, we can assign to each realization of the infinite 
sequence {Bk}keN, denoted by {6„}„eN, a sequence of maps 
(t>uji{z),4'uj2{z),--; where uji = (61, ••-,6.;). We call the se- 
quence {4>ujk}ks;N the corresponding sequence of maps for the 
realization {&fe}feeN- We also use the realization {&fc}feEN and 
its corresponding {4)ui^}keN interchangeably. Let us now focus 
on the asymptotic characteristics of the functions (f)^^ . Firstly, 



since {(/'(.j„(z)}a;„er2„ has the same law as Zn starting at z, we 
conclude that for z e [0, 1] with probability one, the quantity 
limfc^oo ^cjfcC-z) takes on a value in the set {0, 1} . In Figure 6 
the functions (j)^^^ are plotted for a random realization. As it is 
apparent from the figure, the functions (f)^^ seem to converge 
point-wise to a jump function (i.e., a sharp rise from to 1). 
As intuitive justification of this fact is as follows. Consider a 
random function (p^i^ ■ Due to polarization, as n grows large, 
almost all the values that this function takes are very close to 
or 1. This function is also increasing and continuos (more 
precisely, it is a polynomial). A little thought reveals that the 
only choice to imagine for (p^^ is a very sharp rise from being 
almost to almost 1. The formal and complete statement is 
given as follows.. 



n = 



71 = 5 
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Fig. 6. The functions <f>i^^ associated to a random realization are plotted. 
As we see as n grows large, the functions cf>^^ converge point-wise to a step 
function. 



Lemma 10 (Almost every realization has a threshold point): 
For almost every realization of a; = {6fe}/jEN s ^7 there exists 
a point z* e [0, 1], such that 



lim 

n-j-oo 



(t^u^Az) 



^e[0,<) 
2e(4,l] 



Furthermore, z* has uniform distribution on [0,1]. We call 
the point z* the threshold point of the realization {6fc}fceN 
or the threshold point of its corresponding sequence of maps 

{0LJfc}fcEN- 

Looking more closely at (67), by the above lemma we 
conclude that as n grows large, the maps 0^^ that activate 
the identity function 1{.} must have their threshold point 
sufficiently close to z. Let us now give an intuitive discussion 
about the idea behind the proof of Theorem 8. By using (67) 
we can write 



1 



Pr(Z„ 6 [a,fo]) = Y. :^l{0^„(^)<a 



^e*„ 



b]} 



E -^,^{^'^[4>-JJa),4>-J„{b)]}- (68) 



,e*„ 



10 



Hence by Lemma 10, for a large choice of n the intervals 
[(/>^^ (a), 0^^ (6)] have a very short length and are distributed 
almost uniformly along [0,1]. Now, if we assume that the 
length of the intervals [cf)'^^ {a),4'^^ (b)] is very close to their 
average, then we can replace the average in (68) by the average 
length of [0:i(a),fci(&)]. That is, 

PT{Z^e[a,b])^E[cf,zl{b)-rJAa)]. 

So intuitively, all that remains is to compute the average length 
of the random intervals [(f)'^^ (fl)j fc^ (&)]• 

In fact we are not able to make all these heuristics precise 
for the point-wise values -logPr(Z„ e [a, 6]). Nonetheless, 
the picture is naturally precise for the average of Pr(Z„ e 
[a, b]) over z e [0, 1], i.e., 

-log{ rPr(Z„6[a,6])dz}. (69) 

n Jo 

To see this, we proceed as follows. By (68) we have 



as follows. For any non-negative function g : [0, 1] 
that .g(0) = .g(l) = let 



such 



Lg = sup 

ze(0,l),ye[z^2-z2 ,z{2-z)]} 






Similar to the discussion in Section IV-Al, we can show that 
for the process Z„ = Z{Wn) we have 



'^[9{Zn)] < cL-„ 



(72) 



where c = sup^^ro il ffC-^) is a constant. Hence, using the 
Markov inequality we have for a,b e (0, 1), 

ilogPr(Z„ 6 [a, 6]) < logL^ + 0{-). 
n n 

For example, assuming g{z) = (z(l - z))^ we numerically 
obtain that logLg = -0.169. That is 



rl fi 1 

/ Pr(Z„ 6 [a,b])dz = / { XI :^1{«0:l' [a.b]} 



C[(Z„(1-Z„))^]<2 



-0.169n 



(73) 



[dz 



^ 2" Uo 



{«0^l [ 



a,b]}dz} 



0-„ 

= E[cbzl{b)-4,-JAa)l 
and by applying ilog(-) to both sides we have 

ilog{ ['pT{Z^e[aM)dz}= ^ logE[^^i (&) - 0^1 (a))] 
n Jo n 



n 

> -E[\og{cj,zl{b)-rJja))l 
n 

(70) 

where in the last step we have used Jensen's inequality. The 
value of lim„^oo -IE[log((/)j^^ {f>)~4>Z^ (o))] can be computed 
precisely. 

Lemma 11: We have 

1 



\im-E[\og{c^l\{b)-<p-J^{a))] 



As a result, we have 



21n2 



1 



-0.2787. 



1 r^ 1 

liminf-log{ / Pr(Z„ e [a,b\)dz] > ——- 
n^oo n Jo 2 in 2 



1. 



The result of Theorem 8 provides a lower bound that is very 
close to the value we obtained in Section 111 but is not exactly 
equal. This is because we have used Jensen's inequality in 

(70). 

B. Speed of Polarization for General BMS Channels 

For a BMS channel W, there is no simple 1 -dimensional 
recursion for the process Z„ as for the BEC. However, by 
using (12) and (13), we can provide bounds on how Z„ 
evolves: 

— Zf) '. if Bn = 1 , 

r I 9 9. ' (71) 

As a warm-up, we notice that similar techniques as used in 
Section IV-Al are applicable to provide general lower and up- 
per bounds. For instance, to find upper bounds we can proceed 



and for a, 6 e (0, 1) we have 

- logPr(Z„ € [a, b\) < -0.169 + O(-). 
n n 

The relations of type (73) are upper bounds on the speed of 
polarization that hold universally over all the BMS channels. 
Let us now compute universal lower bounds. In the rest 
of this section, it is more convenient for us to consider 
another stochastic process related to Wn, which is the process'* 
Hn = H{Wn)- The main reason to consider iJ„ rather than 
Zn is that the process iJ„ is a martingale and this martingale 
property will help us to use the functions {/„}„eN defined in 
(51) (with the starting function f{z) = z{l - z)) to provide 
universal lower bounds on the quantity E[7J„(1 - Hn)]. We 
begin by introducing one further technical condition given as 
follows. 

Definition 12: We call an integer to e N suitable if the 
function f„i{z), defined in (51) (with the starting function 
f{z) = z{l - z)), is concave on [0, 1]. 

Remark 13: For small values of to,, i.e., m < 2, it is easy to 
verify by hand that the function /,„ is concave. As discussed 
previously, for larger values of m we can use Sturm's theorem 
[17] and a computer algebra system to verify this. Note that 
the polynomials 2™/m have integer coefficients. Hence, all the 
required computations can be done exactly. We have checked 
up to TO, = 8 that fm is concave and we conjecture that in fact 
this is true for all m e N. 

We now show that for any BMS channel W, the value of «„, 
defined in (52), is a lower bound on the speed of decay of _ff„ 
provided that to, is a suitable integer. 

Lemma 14: Let to, e N be a suitable integer and W a BMS 
channel. We have for n>m 

E[HrXl-Hn)] > (a„)"-"7m(i?(W^)), (74) 

where a„i is given in (52). 

Proof: We use induction on n - m: for n - to, = there 
is nothing to prove. Assume that the result of the lemma is 

''For the BEC the processes Hn and Z„ are identical. 



correct for n -m = k. Hence, for any BMS channel W with 
Hn = H{Wn) we have 

E[if,„+fc(l - H^m+k)] > {am)''.fm{H{W)). (75) 

We now prove the lemma for m - n = k + 1. For the BMS 
channel W, let us recall from Section I-B that the transform 
W -^ {W°,W^) yields two channels W° and W^ such that 
(11) holds. Define the process {{W'^)^,n e N} as the channel 
process that starts with W'^ and evolves as in (16). We define 
{{W^)^,n 6 N} similarly. Let us also define the two processes 
H^ = H{{W^)J and H^ = H{{W^)J. We have, 

E[i?m+fe+l(l - Hm+k+l)] 



(b) 



> (flm)' 



Jm{H{W°))+fm{H{W')) 



(^) . .k frnil-il-HiW)y)+f,niHiWr) 

^=^ {am)''fm.l{H{W)) 

, ^kfm+l{H{W)) /„/w\\ 






.fm{H{W)) 

inf ■f-^'^''^ 

he[0,l 



fm{h) J 
{amr^\frniHiW)). 



fm{H{W)) 



In the above chain of inequalities, relation (a) follows from 
the fact that Wm has 2™ possible outputs among which 
half of them are branched out from W'^ and the other half 
are branched out from W^ . Relation (b) follows from the 
induction hypothesis given in (75). Relation (c) follows from 
(22), (23) and the fact that the function /„ is concave. More 
precisely, because /„ is concave on [0, 1], we have the 
following inequality for any sequence of numbers < x' < 
X < y < y' < 1 that satisfy ^^ = ^ ^^ : 



fmjx') + fmjy') ^ frnjx) + fmjy) 



(76) 



In particular, we set x' = H{Wy, x = H{W^), y = H{W°), 
y' = 1 - {1 - H{W)y and we know from (22) and (23) that 
< x' < X < y < y' <1. Hence, by (76) we obtain (c). Relation 
(d) follows from the recursive definition of fm given in (51). 
Finally, relation (e) follows from the definition of a„i given 
in (52). ■ 

Finally in the following two parts, we rigorously relate the 
results obtained in previous sections to finite-length perfor- 
mance of polar codes. In other words, answering Question 4 
is the main focus for the remaining parts of this section. 

C. Universal Bounds on the Scaling Behavior of Polar Codes 

1) Universal Lower Bounds: Consider a BMS channel W 
and let us assume that a polar code with block-error probability 
at most a given value P^ > 0, is required. One way to 
accomplish this is to ensure that the right side of (24) is 
less than P^. However, this is only a sufficient condition 
that might not be necessary. Hence, we call the right side of 



(24) the strong reliability condition. Numerical and analytical 
investigations (see [11] and [18]) suggest that once the sum of 
individual errors in the right side of (24) is less than 1, then it 
provides a fairly good estimate of Pe- In fact, the smaller the 
sum is the closer it is to Pg. Hence, the sum of individual errors 
can be considered as a fairly accurate proxy for Pg. Based on 
this measure of the block-error probability, we provide bounds 
on how the rate R scales in terms of the block-length N. 

Theorem 15: For any BMS channel W with capacity 
I{W) € (0,1), there exist constants Pc,a > 0, that depend 
only on I{W), such that 



impHes 



E E{W^^)<P,, 



R<IiW)-^, 



(77) 



(78) 



where /i is a universal parameter lower bounded by 3.553. 
Here, a few comments are in order: 

(i) As we have seen above, we can obtain an increasing 
sequence of lower bounds, call this sequence {nmjmaN, for 
the universal parameter /i. For each m, in order to show the 
validity of the lower bound, we need to verify the concavity of 
a certain polynomial (defined in (51)) in [0, 1]. We explained 
in Remark 13 how we can accomplish this using the Sturm 
chain method. The lower bound for /i stated in Theorem 15 
is the one corresponding to ?7i = 8, an arbitrary choice. If 
we increase m, we get e.g., ^iq = 3.614. We conjecture 
that the sequence /i„i converges to /i = 3.627, the parameter 
for the BBC. If such a conjecture holds, then the channel 
BEC polarizes the fastest among the BMS channels (see 
Question 2). 

(ii) Let Pc,a,^ be as in Theorem 15. If we require the 
block-error probability to be less than Po (in the sense that the 
condition (77) is fulfilled), then the block-length A^ should be 
at least 



N>{- 



-J 



(79) 



■I{W)-R' 

(iii) From (1) we know that the value of /i for the random 
linear ensemble is /i = 2, which is the optimal value since the 
variations of the channel itself require /i > 2. Thus, given a 
rate R, reliable transmission by polar codes requires a larger 
block-length than the optimal value. 

Proof of Theorem 15: To fit the bounds of Section IV-Al 
into the framework of Theorem 15, let us first introduce the 
sequence {fimjmefn as 

Mm = -r^' (80) 

loga„ 

where am is defined in (52) with starting function f{z) = z{l- 
z). In the previous section, we have proved that for a suitable 
m, the speed with which the quantity E[_ff„(l - Hn)] decays 
is lower bounded by am = 2^'J^, i.e. for n > m we have 
E[Hn{l-Hn)] > 2'^^fm{H{W)). To relate the strong 
reliability condition in (77) to the rate bound in (78), we need 
the following lemma. 

Lemma 16: Consider a BMS channel W and assume that 
there exist positive real numbers 7, 6 and m e N such that 



12 



E[iJ„(l - Hn)] > 72-"'' for n>m. Let a, (3 >0 be such that 
2a + /3 = 7, we have for n> m 

Pr(iJ„ < a2""^) < I{W) - /32~"^ (81) 

Proof: The proof is by contradiction. Let us assume the 
contrary, i.e., we assume there exists n>m s.t., 

Pr(iJ„<a2""'')>/(VF)-/32""^ (82) 

In the following, we show that with such an assumption we 
reach to a contradiction. We have 

E[Hnil-Hn)] 

= K[Hn{l-Hn)\Hn < a2-"'']Pr(i/„ < a2-"'') 
+ E[i7„(l -i7„) |if„ > a2-"'']Pr(if„ > a2-"^). (83) 

It is now easy to see that 

E[if„(l -i7„) |iJ„ < a2-^'] < a2-"^ 

and since E[i7„(l - i7„)] > 72^"^ by using (83) we get 

E[i7„(l - H„) I if„ > a2-"'']Pr(iJ„ > a2-"^) > 2-"'^(7 - a). 

(84) 
We can further write 

IE[(1 -Hn)] = E[l -Hn\Hn< a2-"'']Pr(ff„ < a2-"^) 

+ E[1 -i7„ |iJ„ > a2-"'']Pr(i/„ > a2-"''), 

(85) 

and noticing fact that iJ„ > if„(l - iif„) we can plug (84) in 
(85) to obtain 



In this regard, note that (87) and (88) imply that \A\ > 



29"(i-£r) 



i[{l - Hn)] > E[l - Hn I Hn < a2-"^]Pr(/7„ < a2-"^) 



+ 2-"''(7-a). 



(86) 



We now continue by using (82) in (86) to obtain 

E[(l - iJ„)] > {I{W) - /32-"'')(l - a2-"'') + 2'"^(7 - a) 
> liW) + 2-"''(7 - a(l + I{W)) - /3), 

and since 2q! + /3 = 7, we get E[l - iJ„] > I{W). This is a 
contradiction since iJ„ is a martingale and E[l-_ff„] = I{W). 

■ 
Let us now use the result of Lemma 16 to conclude the proof 
of Theorem 15. By Lemma 14, we have for n> m 

Thus, if we now let 7 = 27^/„(iJ(T^)) and 2a = /3 = ^, 
then by using Lemma 16 we obtain 

Pv{Ha < ^2-^ ) < I{W) - ^2-^ . (87) 

Assume that we desire to achieve a rate R equal to 

R=I{W)--2-i^. (88) 

Let In.r be the set of indices chosen for such a rate R, i.e., 
^N,R includes the 2"R indices of the sub-channels with the 
least value of error probability. Define the set A as 



A = {ielN,R ■■ Hiwl;'^) > ^2-^}. 



As a result, by using (5) and (6) we obtain 






-^2 2«(l-2 ^) 

16 8n -!- 

Mm 



where the last step follows from the fact that for x e [0, -^], 



we have /i2^(x) > 



Thus, having a block-length N 



*log("X)- 

2", in order to have error probability (measured by (24)) less 

2 nil—I 1 ) 

than Ts-^-^^ P^^, the rate can be at most liW) - '^2^'i^ . 

Mm 

Finally, if we let m = 8 (by the discussion in Remark 13, 

3.553 



-log(as) 



(91) 



we know that to = 8 is suitable), then fig, = 
and choosing 

where R is given in (88), then it is easy to see from (90) that 
Pc > (since — < i) and furthermore, to have block-error 
probability less than Po the rate should be less than R given 
in (88). 

2) Universal Upper Bounds: In this part, we provide upper 
bounds on the required block-length of Question 4. Again, 
the key observation here is the upper-bounds on the speed of 
polarization, e.g. the bounds derived in Table III for the BEC 
and the universal bound (73). 

Theorem 17: Let Z„ = Z{Wn) be the Bhattacharyya pro- 
cess associated to a BMS channel W. Assume that for n e N 
we have 

E[(Z„(l-Z„))"]</32-''", (92) 

where a, /?, p are positive constants and a < 1. Then, the block- 
length N required to achieve an error probability Po > at a 
given rate R < I{W) is bounded above by 

logiV< (1 + -)log-+ 
p d 

3 2 3 

-))2 + C5 log(log( — )) log(log -; 
d Fc d 

where d = I{W) - R and 04,05 are positive constants that 
depend on a, /3, p. 

Before proceeding with the proof of Theorem 17, let us note 
a few comments: 

(i) In the previous sections we have computed several 
candidates for the value p required in Theorem 17. As an 
example, using the universal candidate for p obtained in (73) 
(i.e., p = 0.169), we obtain the following corollary. 

Corollary 18: For any BMS channel W, the block-length 
N required to achieve a rate R < I{W) scales at most as 

1 



C4(l0g(l0g^))2H-C5l0g(l0g(^))l0g(l0g3), (93) 



N<<d{ ). (94) 

One important consequence of this corollary is that polar codes 
require a block-length that scales polynomially in terms of the 
gap to capacity^. 



The fact that polar codes need a polynomial block-length in terms of the 
(89) g^P to capacity is also proven in the recent independently-derived result of 
[19]. 
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(ii) As we will see in the proof of Theorem 17, the result 
of this theorem is also valid if we replace Pg with the sum of 
Bhattacharyya values of the channels that correspond to the 
good indices (this sum is indeed an upper bound for Pc). 

Proof of Theorem 17: Throughout the proof we will be using 
two key lemmas (Lemma 19 and Lemma 20) that are stated 
in the appendices. Let 



d=I{W)-R. 



(95) 



We define uq e N to be 



no 



1^^ 3(l + ci)(l + 2c2C3) 
p d 



where the constants ci, C2 and c^ are given in Lemmas 19, 20 
and 21, respectively. As a result of Lemma 19 and (96), we 
have for n> no 



We also define the set A as follows. Let Nq = 2"" 



and 



A={ie{0,-,NQ 



(97) 



(98) 



In other words A is the set of indices at level no of the 
corresponding infinite binary tree of W (see Section I-C) 
whose Bhattacharyya parameter is not so large. Also, from 
(97) the set A contains more than a fraction R of all the 
sub-channels at level no. The idea is then to go further down 
through the infinite binary tree at a level no + ni (the value 
of ni will be specified shortly). We then observe that the sub- 
channels at level no + ni that are branched out from the set 
A are polarized to a great extent in the sense that sum of 
their Bhattacharyya parameters is below P^ (see Figure 7 for 
a schematic illustration of the idea). 



level 



level 1 




indices at level no + ni with the following properties: (i) sum 
of the Bhattacharyya parameters of the sub-channels in this 
set is less than P^ and (ii) the cardinality of this set is at least 
jj2"o+ni jjj what follows, we will first use the hypothesis of 
Lemma 20 to give a candidate for ni and then we make it clear 
that such a candidate is suitable for our needs. Let {Bm}mm 
be a sequence of iid Bemoulli(i) random variables. We let ni 
be the smallest integer such that the following holds 



Pr(2" 



2na+ni ' 3 



(99) 



(96) It is easy to see that (99) is equivalent to 



"11 d 

Pr(Xi3, > log(log--) +log(no + ni)) > 1 - -. (100) 

Also, as the random variables Bi are Bernoulli(|) and iid, the 
relation (100) is equivalent to 



log(log ■^)+log(no+"l) 



(7) 



2"i 



d 

< -. 

3 



(101) 



A sufficient condition for (101) to hold is as follows: 

l+log(log ■J-)+log(no+"l) , 



2"1 3 

and after applying the function log(-) to both sides and some 
further simplifications we reach to 

1 3 

ni -(l + log(log-— ) +log(no + ni))logni >log-. (102) 

Pe d 

It can be shown through some simple steps that there are 
constants C6,C7 > s.t. if we choose 



ni 



3 3 2 3 

log - + C6(l0g(l0g -))^ + C7 log(log( — )) l0g(l0g -) 

d d Pr d 



(103) 

then the inequality (102) holds. Now, let N = 2"«+"i and 
consider the set Ai defined as 



level TiQ 



level no + '^i 



A ...A A 

/ \ / \ ' \ 

/ \ / \ ' \ 

/ \ / \ / \ 

/ \ / \ ' \ 



Fig. 7. The infinite binary tree of channel W . The edges that are colored 
red at level no of this tree correspond to the sub-channels at level no whose 
Bhattacharyya parameter is less that i (i.e., the set A). The idea is then to 
focus on these "red" indices. We consider the sub-channels that are branched 
out from these red indices at a level no + n\ (as shown in the figure). By 
a careful choice of ni, we observe that these specific sub-channels at level 
no + ni are greatly polarized in the sense that sum of their Bhattacharyya 
parameters is less than Pa- We also show that the fraction of these sub- 
channels is larger than R. 



A = {«6{0,-,iV-l}:Z(I^W)<_|} (104) 



P. 



We now show that 



N 



>R. 



(105) 



This relation together with (104) shows that block error 
probability of the polar code of block-length N and rate R 
is at most P^. In order to show (105), we consider the sub- 
channels Ai that are branched out from the ones in the set A. 
Let i e A and consider the sub-channel W^ . By using the 
relations (71), Lemma 20 and (99) we conclude the following. 
At level no + ni, the number of sub-channels that are branched 
out from W^' and have Bhattacharyya value less than ^ is 
at least 



We proceed by finding a suitable candidate for ni. Our 2"i(l 

objective is to choose ni large enough s.t. there is a set of 



-C2Z(I^«)(1 



Ids 



ZiW^N!) 



))(1-:t)- 



Hence, by using (98) the total number of sub-channels at level 
no + ni that are branched out from a sub-channel in A and 
have Bhattacharyya value less that -^ is 



N 






z{w^;!) 



(i06) 



Now, by using Lemma 21 we have 



C2E^(^l?o)(l+log 



1 



iaA 



Z{W^n!) 



) 



ieA 

<2c2C3E[(Z„„(l-Z„J)"] 

< 2c2C32-"«'' 
(96) d 

< -. 
3 

Therefore, the expression (106) is lower-bounded by 

(J (J 

Hence, the relation (105) is proved and a block-length of size 
N is sufficient to achieve a rate R and error at most P^. It is 
now easy to see that logiV = no + ni has the form of (93). 
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Appendix A 
Proofs 

1) Proof of Lemma 5: The proof of the right side (59) and 
also (60) is an easy application of the Markov inequality. To 
prove the left side of (59), we define sequences {x„}„>i and 

{Vn}n>l as 

Xn = 2-", (107) 

2/„ = l-2-". (108) 

We start by noting that 

n 

E(Z„(l-Z„))<E2-^Pr(Z„6[x,+i,x,]) 

1=1 

n 

+ E2-'Pr(Z„6[y„y,+i]) 

i=l 

+ 2-"". 

As a result, there exists an index j e {l,---,n} such that at 
least one of the following cases occurs: 

E[Z„(1 - Z„)] < 2n[2-^Pr(Z„ e [x,+i,x,]) + 2-"], (109) 

or 

E[Z„(1 - Z„)] < 2n[2-^Pv{Z„ e [%, %>i]) + 2-"]. (110) 

We show that in each of these cases the statement of the lemma 
holds. Firstly, note that because of the symmetry of Z„ we can 
write 

Hence, without loss of generality we can assume that (109) 
holds. We first prove the lemma for a = 1 - 6 = j. We then use 
this result to prove the lemma in its fullest extent. We claim 
that for any 1 < j < n we have, 

IS v^ 
2-^Pr(Z„ 6 [x,,i,x,]) < 2(n+ l)Pr(Z„ e [-, -]) + -. 

(Ill) 
Assuming that the above claim holds true, by using (109) we 
obtain 

E(Z„(1 - Z„)) < 2n[Pr(Z„ e [1, ^]) + I^], 

and as a result, by taking — log(-) from both sides, the first 
part of the lemma is proved for a = 1 - 6 = j. 

We now turn to the proof of relation (111) for 1 < j < n. For 
j = 1, the result of the claim is trivial. Hence, in the following 



we assume that 2 < j < n. We now prove that for any fixed j 
such that 2 < j < n, we have 

2-^Pr(Z„ 6 [xj,i,x,]) < 2(n + l)Pr(Z„ e [-, -]) + -, 

(112) 
and hence the relation (111) is also proved. We fix the index 
j and prove the above claim for any value of n e N. The proof 
consist of two steps. 

Stepl: We first show that Vm e N, 

3 1 

Pr(Z„ 6 [x2j+2,Xj]) < mPr{Z,n e [xj, -]) + — . (113) 

4 z" 

To prove (113), fix m e N and define the sets A and B as 

A= {{bi,---,bjn) en,n -h^ o---o4^(z) e [x2j+2,Xj]}. 

3 
B= {(6i,---,6„) en„i -h^ o---oth,{z) e [xj,-]}. 

In other words, A is the set of all the paths that start from 
z = Zq and end up in [x2i+2,Xj] and B is the set of paths 
that start from z and end up in [xj, |]. We now partition the 
A into the disjoint sets Ak, fc e {0, 1, ■■■,m}, defined as 

Ak = {ih,--,bm)eA:bk = l:h = ^i>k}. (114) 

It is easy to see that \A-Uj,Ak \ < 1. Our aim is now to show 
that for k e {0, l,---,m}. 



\Ak\<\B\. 



(115) 



To do this, we show that there exists a one-to-one correspon- 
dence between Ak and a subset of B. In other words, we claim 
that we can map each member of Ak to a distinct member of 
B. Consider {bi,---,bm) s Ak- We now construct a distinct 
member {b[,---,b[^) e B corresponding to {bi,---,b„i). We 
first set b'^ = bi for i < k and hence the uniqueness condition 
is fulfilled. Consider the number x defined as 



:if fc= 1, 



(116) 



4fc_i °---°thj^{z) ;if fc> 1. 

Note that since (6i,---,6m) s Ak we have 

h^ °---°tbk{x) e [x2j+i,Xj]. (117) 

Now, note that as (fei, •••, b„i) e Ak, we have bk = 1 and bi = 
for i> k. Thus, in this setting (117) becomes 



m-k times 



toO---oto{x ) 6 [x2j + l,Xj]. 



Hence, 



X2J+1 < l-(l-a; ) < 



(118) 



(119) 



From the left side of ( 1 1 9) and using the fact that l-(l-a;)^ < 
2x we obtain 



X2j>i < 2"-'=x2 => 2-^+^"?^ < X. 
From the right side of (119) we have 



(120) 



ln(l-Xj)<2"-''lii(l-a;"), 

2 

and by using the inequality -x-^ < In(l-x) < -x we obtain 



_^ -J , k-m + l 

x<2~^^^~. 



Let us recall that we let b'^ = bi for i < k. We now construct 
the remaining values b'f,,---,b'^ by the following algorithm: 
consider the number x given in (116). In the following, we 
will also construct a sequence x = Xk-i,Xk,Xk+i,---,Xm such 
that for i > k we have Xi = ti,'{xi-i). Begin with the initial 
value Xk-i = X and for i > k recursively construct b'i from 
bi_i and Xi-i by the following rule: if ti,'{xi-i) < |, then 
b'i = and Xi = to{xi-i), otherwise 6^ = 1 and Xi = ti{xi-i). 
We now show that the value of Xm is always in the interval 
[xj, |]. In this regard, an important observation is that for i 
s.t. k - 1 < i < m, once the value of Xi lies in the interval 
[xj, |] then for all i < t < m we have xt e [xj, |]. Hence, 
we only need to show that by the above algorithm, the exists 
an index i, s.t. k - 1 < i < m, and the value of Xi lies in 
the interval [xj, |]. On one hand, observe that due to (121) 
and the fact that j > 2, we have x < 2^2 < |. Thus, the 
value of Xi is definitely less than | for i > k. If the value of 
Xk-i is also greater than Xj then we have nothing to prove. 
Else, it might be the case that x < Xj. We now show that in 
this case the algorithm moves in a way that the value of Xm 
falls eventually into the desired region [xj, |]. To show this, 
a moment of thought reveals that this is equivalent to showing 
that we always have 

m-k+l times 



too---oto{x) = l-{l-xy >Xy (122) 

Note that the function l-(l-a;) is a strictly increasing 

function of the unit interval. Thus, in order to have (122) it is 
equivalent that 

2'"-'=+Mn(l-a;)<ln(l-Xj), 

and after some further simplification using the inequality -x - 

2 

^<ln(l-x)< -X, we deduce that a sufficient condition to 
have (122) is 



Xj < 2'^-^x => 2-^'+'^-" < X. 



(123) 



But this sufficient condition is certainly met by considering 
the inequaUty (120) and noting the fact that -j + ^^2n±l > 
-j + k - m. Hence, the claim in (115) is proved and as a 
result, the claim in (113) is true. 

Step 2: Firstly note that in order for Z„ to be in the 
interval [xj+i,Xj], the value of Zn-j should lie in the interval 

[x2j+i,x^ J. As a result, we can write 
Pr(Z„ € [xj+i,xj]) 

= Pr(Z„ € [Xj + i ,Xj]\ Zn-j € [X2j + 1 ,Xj\) 

xPr(Z„_j6 \x2j+\,Xj\) 



+ Pv{Zne[Xj+i,Xj]\Zn-j ^{Xj,x] ]) 



: Pr(Z„_,6 ( 



-2J _ 






]), (124) 



_jc V-^j j-^j 

and by letting m = n-j in relation (113), we can easily obtain 

,2 



3 n"^ + 1 

¥v{Zn-j € [X2j + l,xj]) < nPx{Zn-j ^[Xj,-]) + ^5;;— • 



(125) 



(121) 



Thus, by combining (124) and (125), we obtain 
Pr(2'„ 6 [xj+i.Xj]) 
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< nPr{Zn-j 6 [xj,-])+Pr{Z„.j e [x^^x] ' ]) 



(126) 



Finally, in order to conclude the proof of (112), we prove the 
following relations: 



1 3- 



2-^Pr(Z„_,6[x„-])<Pr(Z„6[-,-]) 



and 



4'4^ 



1 3, 



(127) 



2-^Pr(Z„_,6[x„xj ])<Pr(Z„ 6 [-,-]). (128) 

Firstly note that since {xj)^'^' > |, then it is enough to 
prove (128). To prove (128), we only need to show that 

for a value x s.t. x e [xj,{xj)2^' ], there exists an j-tuple 
(6i,---,6j) 6 r^j such that t(,i ° ••• °tbj{x) e [|, |]. We show 
this by constructing the binary values bi,---,bj in terms of 
X. Consider the following algorithm: start with yo = x and 
for 1 < i < j, we recursively construct bi from j/i_i by the 
following rule: If io(j/i-i) ^ |. then bi = and yi = io(j/i-i)- 
Otherwise, let bi = 1 and yi = ii(xi_i). To show that this 
algorithm succeeds in the sense that j/j e [-j,!], we first 
observe that once the value of yi lies in the interval [-j, |] 
(for some 1 < i < j), then for alH < i < j we have yt e [i, |]. 
Hence, we only need to show that by the above algorithm, the 
exists an index i, s.t. 1 < i < j, and the value of yi lies in the 
interval [|;, |]. On one hand, assume yo = x e[xj,j). We can 
then write 

j times 

too-oto{x) = l-{l-xf 
>l-{l-x,f 

1 

^2' 

where the last steps follows from the fact that Xj = 2 
the other hand, assume x e (|, (xj) 2T]. We can write 



On 



J times 



tlO--oti{x) < ((Xj)2« )2' 



< X-i < 



As a result, the above algorithm always succeeds and the 
lemma is proved for a= 1- b= j. 

We now prove the lemma for any choice of a,b e (0, 1) s.t. 
^a < 1 - \/l - b. Let pn{z,a,b) be defined as in (27). We 
have 



Pn+i{z,a,b)= Y, :;;^l{zefci Aa,b]} 



E 



1 



1 '^{^^<p-JJt-oHa),t-^{b)]} + Jt{z.0-ijt-i(a).t-i(6)]} 



+ 1/ 



7;{Pn{z,to^{a),tQ^{b))+pn{z,t^^{a),t^^{b),z)). 



It is easy to see that if ^/a < 1 - vT-fe, then 

[to\'^),ti'm c [t,\a),t,\b)] u [t-,\a),t-,\b)l 



and hence, 

2p„+i(z,a,fe) >Pn{z,tQ^{a),tl'^{b)). 
Continuing this way, we can show that for tti e N 

2"^Pn+rn{z,a,b) 



> p„(z, to' ° - ° to\a),tl^ o ... o tl\b)). (129) 
As m grows large, we have 



io'°-°io'(«)-0, 



t^^ o---otl^{b) -^ 1. 

Therefore, by (129) there exists a positive integer toq such 
that for n € N 

1 3 

2">„+mo(z,a,6) >Pn(z,-,-). 

The thesis now follows from this relation together with the 
result of Lemma 54. 

2) Proof of Lemma 10: Recall that for a realization lo = 

{bk}keN e ri we define a;„ = (6i,...,6„). The maps to and ti, 
hence the maps 0cj„s, are strictly increasing maps on [0,1]. 
Thus (puni^) ~* implies that (pi^^{z') -^ for z' < z and 
(f>u:n{^) ~^ 1 implies that (pi^^{z') -^ 1 for z' > z. Moreover, 
we know that for almost every z e (0, 1), lim„^oo (pcu^i^) is 
either or 1 for almost every reahzation {<puj^}nm- Hence, it 
suffices to let 

To prove the second part of the lemma, notice that 
z = Pt{Z^ = 1) 



Pr(0„„(z)^l) 
Pr(inf{z : (t)uj^iz) 
Pr(z:<z). 



l}<z) 



Which shows that z^ is uniformly distributed on [0, 1]. 

3) Proof of Lemma 11: In order to compute 
lim„^c^E[ilog((/i^^(6) - 0^^(a))], we first define the 
process {Zn}nmu{Q} with Zo = z e [0, 1] and 



l-Vl-^n, 



w.p. 
w.p. 



2' 

1 

2' 



(130) 



We can think of Z„ as the reverse stochastic process of Z„. 
Equivalently, we can also define Z„ via the inverse maps 
to^, t\^ . Consider the sequence of i.i.d. symmetric Bernoulli 
random variables Bi,B2,--- and define Z„ = V'(^„(2) where 
Wn - (bi,---jbn) 6 ^n and 



A 



tl 



■otl 



(131) 



We now show that the Lebesgue measure (or the uniform 
probability measure) on [0, 1], denoted by v, is the unique, 
hence ergodic, invariant measure for the Markov process 
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Zn- To prove this result, first note that if Z„ is distributed 
according to the Lebesgue measure, then 

Pr(Z„+i <x) = ipr(Z„ < io(a;)) + ^Pr(^n < ii(a;)) 

= -x2 + -(2x-x2)=a;. 

Thus, Z„+i is also distributed according to the Lebesgue 
measure and this implies the invariance of the Lebesgue 
measure for Z„. In order to prove the uniqueness, we will show 
that for any z e (0, 1), Z„ converges weakly to a uniformly 
distributed random point in [0, 1], i.e.. 



^n = V'c^„(z) ^V. 



(132) 



Note that with (132) the uniqueness of v is proved since for 
any invariant measure p assuming Z„ is distributed according 
to p, we have 

p(-) = Pr(Z„ e •) = / Pr(Z„ e ■)p{dz) 4- j.(.). (133) 

To prove (132), note that V'a;„ has the same (probability) law 
as (/)^^ and we know that 0^^(z) -^ z^ almost surely and 
hence weakly. Also, z^ is distributed according to v, which 
proves (132). We are now ready to show that 



where c\ is a positive constant that depends on a,j3,p. 

Proof: The proof consists of three steps. First, consider 
an arbitrary BMS channel W and let Z„ = Z{Wn)- Also, 
consider the process _E„ = 1 - Z,^ . By using the relations (12) 
and (13), it can easily be checked that the process £'„ has the 
form of (140) and hence Lemma 20 is applicable to En- We 
thus have from (141) that for n e N 

Pr(£;„>i)<C2So(l + log-^). 

As a consequence 

I{W) = lim Pr(S„ > \) 

<C2{l-Z{Wf){l + \0i 



i-z{wy 



:)■ (138) 



In the second step, we consider a channel W for which (136) 
holds for n 6 N. By using (136), it is easy to see that for n 6 N 

E[{Zl{l-Zl)ri^z„>_i}] 
= E[(Z„(1 + Z„))"(Z„(1 - Zn)ri[z„>i}] 

< sup (z(l + ;2))"E[(Z„(l-Z„))"l{z„>i}] 



\imE[-log{4>-JAb)-rJM)] = -^ 
n^oo 7j " " 2 in 2 



1. 



(134) 






(139) 



Using the mean value theorem, we can write 

V'«(a) -ipn{b) =tp'„{c){b-a), 
for some c e (a, b). And by chain rule, 

V'L„(c) = {hi °hl_^ o---otll)'{c) 

= tbl {c)-tbl {hl{c))-----tbl {hi_i°---°t'a\{c)) 
= hi' {■^Q{c)).hl' {4)i{c)).---.tll\\l}r,-i{c))), 

and after applying log(-) to both sides we obtain 

-log(^:„(c)) = -^lnt,/'(^,_i(c)). (135) 

By the ergodic theorem, the last expression converges almost 
surely to the expectation of logi^^ (C/), where U is assumed 
to be distributed according to i^. Hence, the asymptotic value 
of (135) can be computed as 

E[logts\'{U)] 

^\L ^""siV^Ydx + ^J^ \og{i -VT^Ydx 

21n2 

Appendix B 
Auxiliary Lemmas 

Lemma 19: Consider a channel W with its Bhattacharyya 
process Z„ = Z{Wn) and assume that for n e N 



In the final step, we consider a number n € N and let N = 2" 
We then define the set A as 

with A'^ being its complement. We have 



< 2c2(l-Z«^')2)(l + log 

iiA" 
(b) 



l-Z{W^^Y 



< E4c2C3(^(T4^1?^f(l-^(<^^f))" 

ieA" 

= Ac2C^NE[{Zl{l-Zl)rl^z^,_^_^] 
^< 9,C2CzN P2-''P . 

Here (a) follows from (138), (b) follows from Lemma 21 and 
the fact that for a; < | we have 1 + log ;^ < 4 log i , and (c) 
follows from (139). Now, as a consequence of the above chain 
of inequalities we have 



\A\>Y.i{w\;^) 

= NI{W)- E^(w^l?^) 

ieA" 

>N{I{W)-2c2C:ip2-^P), 



E[(Z„(l-Z„))"]</32- 



/lo-r-v and consequently 



where a,l3,p are positive constants with a < 1. We then have 

for n 6 N 

(137) 



Pr(Z„<^) = ^>2c2C3/32-"''. 



Pr(Z„<-)>/(W^)-Cl2-"^ 



Hence, the proof follows. 
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Lemma 20: Consider a generic stochastic process {X„}„>o 
s.t. Xo = X, where x e (0, 1) and for n > 1 



X„< 






;if B„ = l, 
;if S„ = 0. 



(140) 



Here, {i?„}„>i is a sequence of iid random variables with 
distribution Bernoulli(i). We then have for n e N 



Pr(X„ < 2- 



')>l-C2a;(l+log-), 



(141) 



where C2 is a positive constant. 

Proof: We analyze the process An 
- log X = ao and 



■logX„ , i.e., Ao 



A, 



2A, 

An 



1 



■,ifBn 
■MBn 



(142) 



Note that in terms of the process An, the statement of the 
lemma can be phrased as 

l + flo 



Pr(A„>2^-i^')> 1 



C2- 



2«o 



Associate to each (6i,---,6„) = a;„ e il„ a sequence of "runs" 
(ri, •••,r/j(„^)). This sequence is constructed by the following 
procedure. We define ri as the smallest index i e N so that 



fc-i 



bi+i + bi. In general, if Y.j=i fj < n then 



fe-i 



k-l 



Tk 



iin{i| ^ 



ri < I < n 



h+i*b^,-j^^}-Y,^ 



J=l J=l 

The process stops whenever the sum of the runs equals n. 
Denote the stopping time of the process by A:(w„). In words, 
the sequence (6i,---,6„) starts with bi. It then repeats bi, ri 
times. Next follow r2 instances of 5i (6i := 1 - 6i), followed 
again by r^ instances of 6i, and so on. We see that 6i and 
(ri, •••,ri.(^^)) fully describe ojn = (6i,---,6„). Therefore, 
there is a one-to-one map 



(6i,---,6„) ^^ {bi,{ri,---,rk(^^))}. 



(143) 



Note that we can either have 6i = 1 or 6i = 0. We start with 
the first case, i.e., we first assume Bi = 1. We have: 






E 



'3' 



and 



j odd < k(un) 
fe(w„) 
3 = 1 

Analogously, for a realization {bi,b2,---) - uj e ft of the 
infinite sequence of random variable {Biji^fi, we can associate 
a sequence of runs (ri,r2,---). In this regard, considering 
the infinite sequence of random variables {Bi}i^fi (with the 
extra condition Bi = 1), the corresponding sequence of 
runs, which we denote by {-RfcjfeeN, is an iid sequence with 
Pr(_Ri = j) = i-. Let us now see how we can express the 
An in terms of the ri,r2, •••,rj.(^^). We begin by a simple 
example: Consider the sequence (6i = l,b2,---,bs) and the 
associated run sequence (ri,---,r5) = (1,2,1,3,1). We have 



As = ao2''i 



■'^2, 



A^ = {ao2'^' - r2)2^' = ao2^'^'^' - r22'■^ 
Ar = iaoT' - r2W' -r^ = ao2^'^'-' - r22'-^ - n, 
A8 = ((aox2'-i-r2)x2'--^-r4)x2'-^ 
= ao2''^^'^'^'''''-r22'^^+''^-r42''^ 
= 2'^i+''3+'^=(ao - 2-'^V2 - 2-(''i+'^^V4). 

In general, for a sequence (6i,---,6„) with the associated run 
sequence {ri,---,ri^(^^^^) we can write: 

An = ao2^* •""' s *=("") '■' - Y, n2^'^' ""'' ''' 

i even < A;(wn,) 

— Q^2^^ ^'^^ - ^^^n) ^^ — y^ T-2 ^-^ "'''^ ^ ^ ^i+Ei odd < k{LOn) '^^) 

I even < k{uJn) 

= [2^-^^ ^ "■(-") '■•][ao-( XI r,2-2^""<'''0] 

i even < fc(a;Ti) 

= [2S?=i^'][ao-( E r,2-S.-<.'-.)]. 

i even < k(ujn) 

Our aim is to lower-bound 

Pr(A„> 2^-1^0 
= Pr(ao- X ri2"^3»'"i<^''^ > 1), 

i even < /c(cjn) 

or, equivalently, to upper-bound 

Pr( X rj2"^^°^^<'''^ >ao-l). (144) 

i even < k(ujji) 

For n € N, define the set C/„ e .?>i as 

Un = {uJn e rinl^l < HiOn) ■■ E ^,2" ^^ "^^ < ' ''^ > Oq " 1} . 

i even < / 

Clearly we have: 

Pr( X r,2"2^"^^<'''^ >ao-l) <Pr(L/„). 

2 even < k{LUn) 

In the following we show that if (61, •••, 6„) e [/„, then for any 
choice of 6„+i, (61, ••■,6„,6„+i) 6 [/„+i. We will only consider 
the case when 6„,6„+i = 1, the other three cases can be 
verified similarly. Let w„ = {bi,---,bn-i,bn = 1) s Un- Hence, 
k{ujn) is an odd number (recall that 61 = 1) and the quantity 
T,i even < k(uj„) Ti2' ^^ """ ^ ' ""' docs not depend on rk(^^). Now 
consider the sequence cj„+i = (61, •••, 6„ = 1,1). Since the last 
bit (bn+i) equals 1, then ^^^((^^^j) = r^^^^^ and the value of 
the sum remains unchanged. As a result (61, ••■, 6„, 1) e Un+i- 
From above, we conclude that 9i{Ui) £ 9i+i(Ui+i) and as a 
result 

Pr(;70 = Pv{0,{U^)) < Pr(0,+i(C/,+i)) = Pr(C/,+i). 

Hence, the quantity lim„^oo P'^iUn) = lim„^oo Pr(0„ ([/„)) = 
lim„^cx) Pr(u"^]^0i(f7i)) is an upper bound on (144). On the 
other hand, consider the set 

V = {u}en\3l: X rj2"2^°-^<'''^ >ao-l}. 

i even < I 

By the definition of V we have u^j0i( J7i) £ \^, and as a result, 
PT(Li'^iOi{Ui)) < Pt{V). In order to bound the probability of 
the set V, note that assuming Bi = 1, the sequence {-Rfc}fceN 
(i.e., the sequence of runs when associated with the sequence 
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{Bi}ieN) is an iid sequence with Pi {Ri = j) = i. We also 
have 

Pr(ao- Y, R,2-^^"<"'<^^^ <1) (145) 

i even < m 

= Pr( ^ R^2-^^<""'<^"^ >ao-l) 



n 

Y, Pi-(^n > 2^-1 ^* I -Ri = *, Bi = 0)Pr(i?i =i\Bi=0) 

1=1 

X; Pr(^n > 2^-1 ''* I -Ri = «, Si = 0)Pr(i?i = i I Bi = 0) 



i<an-l.'i<n 



z even < m 



i?-2~ ^J odd < i ^j - 



C[2S. 



'] 



X; Pr(i?i=i|Si=0) 

i>aQ-l,i<n 

y 1 3 ^2 

^ «ao^...„ 2' 2-0-1- + 2ao-l 

. 3ao 



2"o-i 

where the last step follows from the Markov inequal- 
ity. The idea is now to provide an upper bound 
on the quantity E[2^''^^-^^^'^''''°"^'"'l Let X = 
E,eve„<™^.2-£..-<'«^.Wehave 

oo 

= ^Pr(i?2 = 0E[2^|fi2 = ?] 
1=1 

1=1 ^ 

= ^-E[2^]E[2F] 
1=1 ^ 
°° 1 y 

k2\2^-^) ^ ^ 

i=i2'(2^"F) 

where (a) follows from the fact that i?iS are iid and X is self- 
similar and (b) follows from Jensen inequality. As a result , an 
upper bound on the quantity E[2'^] can be derived as follows. 
We have 



2"o-i 
Hence, considering the two cases together, we have: 

3(1 + ao) 



Hence, the proof follows with C2 = 3. 



2^0 



Lemma 21: Let a < 1 be a constant. We have for x e (0, |] 



where 



a;log(-)<C3(x(l-x))", 

X 



C3 



(146) 



(147) 



(1-Q:)ln2 

Proof: By applying the function log(-) to both sides of 
(146) and some further simplifications, the inequality (146) is 
equivalent to the following: For x e (0, |] 

log(log— ) < logC3 + (1 -a) log— + Q!log(l - x). 

X X 

As a; < |, we have Q!log(l - a;) > -log 4. Hence, in order for 
the above inequality to hold it is sufficient that for x e (0, |] 

1 C3 1 

log(log-) <log— + (l-a)log-. 
a; 4 X 



E[2-^]< 



1 .^r,X,^l 1 



Now, by letting u = log-, the last inequality becomes 



2(2* -1) 



(E[2'^])3 + 



4(23-1) 



(E[2-^]) 



C3 
(1 - a)M-logu + log — > 0, 



(148) 



1 



4(25-1) 



(E[2-^]) 



The equation y 



i 1 i 1 i U 

(2 H ^ — yi H ^ — V^ has 

2(2t-l)" 4(21-1) 4(21?-1) 



-T — -y2 



only one real valued solution y* , and y* <3 (more precisely, 
y* « 2.87). As a result, we have E[2^] < y* < 3. Thus by 
(145) we obtain 

Pr(ao- Y i?.2-S.o.a<.«.<i)< 3 



for u > log(|). It is now easy to check that by the choice of 
C3 as in (147), the minimum of the above expression over the 
range u > log(|) is always non-negative and hence the proof 
follows. ■ 



t even < m 

Thus, given that Bi = 1, we have: 



2^0- 



Pr(^„>2^-i^') > 1 



2ao-l' 



Or more precisely we have 



FT{An>2^'-^^'\Bi =1)> 1 



2ao-l' 



Now consider the case Bi = 0. We show that a similar bound 
applies for An- Firstly, note that by fixing the value of n the 
distribution of i?i is as follows: Pr(_Ri) = i- for 1 < « < n - 1 
and Pr(i?i = n) = 2^- We have 

Pr(A„>2^-i^'|Bi=0) 



